Making room for the modeler in data mining

نویسنده

  • Robert A. Stine
چکیده

I would like to use this perspective to advocate further research in data mining toward methods that allow and formalize greater input from the modeler. By ‘modeler’, I mean the domain expert, scientist, business analyst, or researcher who has questions to answer using data. Soliciting input from such users is atypical of most data mining algorithms (e.g., see the introductory discussion in ref. [1]). Although excluding pesky users enables one to enforce protections against, say, overfitting, those same users may come to resent the exclusion. Many fields remain deeply suspicious of black-box models that generate predictions without asking for help or supplying explanations. Handpicked regression models remain the model of choice in the social sciences, for instance, because these allow the user to pick the features and specify the relationships. Regression allows the model to express a theory about how the world works, and the success of the model depends on this specification. To this community, the fact that we can automate the process of selecting features or supply black-box predictions that are more accurate than those of regression is secondary to the absence of interpretative hooks. Allowing the user to customize a statistical model introduces any number of problems. With practice and a bit of collinearity in the data, one can mold a regression model to support a variety of theories by iteratively adding and removing variables. We know how to keep an automated search for features from overfitting; the challenge is to involve users in this process. Modelers who invest years gathering data are not eager to lay down their favorite tools unless we provide them with alternatives that respect their knowledge of the context and their desire to influence the analysis. A modeler who has a better theory and understanding of the data should be able to get a better model than one who lacks these insights. Black-box models remove the competitive advantage: both modelers pour the data into the model and pop out the same predictions. The methodologies that I want to encourage reward the modeler who has useful insights. I have a couple of illustrative examples in mind. First, consider the grouped LASSO [2] in the classical regression setting: for a sample of n observations, we observe a response y and k explanatory features in the vector x. The goal is a precise estimate of E y = x ′β. The original LASSO provides a method for identifying useful features. Rather than relying on stepwise or heuristic searches, LASSO selects the coefficient vector β̂ that solves

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a Model for Predicting Tax Evasion of Guilds Based on Data Mining Technique

In this research, considering the importance of the topic and the gap in previous researches, a model for predicting tax evasion of guilds based on data mining technique is presented. The analyzed data includes the review of 5600 tax files of all trades with tax codes in Qazvin province during the years 2013-2018. The tax file related to guilds is in five tax groups, including the guild group o...

متن کامل

Evaluation of Data Mining Algorithms for Detection of Liver Disease

Background and Aim: The liver, as one of the largest internal organs in the body, is responsible for many vital functions including purifying and purifying blood, regulating the body's hormones, preserving glucose, and the body. Therefore, disruptions in the functioning of these problems will sometimes be irreparable. Early prediction of these diseases will help their early and effective treatm...

متن کامل

Application of Rough Set Theory in Data Mining for Decision Support Systems (DSSs)

Decision support systems (DSSs) are prevalent information systems for decision making in many competitive business environments. In a DSS, decision making process is intimately related to some factors which determine the quality of information systems and their related products. Traditional approaches to data analysis usually cannot be implemented in sophisticated Companies, where managers ne...

متن کامل

Data mining for decision making in engineering optimal design

Often in modeling the engineering optimization design problems, the value of objective function(s) is not clearly defined in terms of design variables. Instead it is obtained by some numerical analysis such as FE structural analysis, fluid mechanic analysis, and thermodynamic analysis, etc. Yet, the numerical analyses are considerably time consuming to obtain the final value of objective functi...

متن کامل

Combining data mining and group decision making in retailer segmentation based on LRFMP variables

Data mining is a powerful tool for firms to extract knowledge from their customers’ transaction data. One of the useful applications of data mining is segmentation. Segmentation is an effective tool for managers to make right marketing strategies for right customer segments. In this study we have segmented retailers of a hygienic manufacture. Nowadays all manufactures do understand that for st...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Statistical Analysis and Data Mining

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2012